library(tidyverse) # for graphing and data cleaning
library(gardenR) # for Lisa's garden data
library(lubridate) # for date manipulation
library(ggthemes) # for even more plotting themes
library(geofacet) # for special faceting with US map layout
theme_set(theme_minimal()) # My favorite ggplot() theme :)
# Lisa's garden data
data("garden_harvest")
# Seeds/plants (and other garden supply) costs
data("garden_spending")
# Planting dates and locations
data("garden_planting")
# Tidy Tuesday data
kids <- readr::read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2020/2020-09-15/kids.csv')
Before starting your assignment, you need to get yourself set up on GitHub and make sure GitHub is connected to R Studio. To do that, you should read the instruction (through the “Cloning a repo” section) and watch the video here. Then, do the following (if you get stuck on a step, don’t worry, I will help! You can always get started on the homework and we can figure out the GitHub piece later):
keep_md: TRUE in the YAML heading. The .md file is a markdown (NOT R Markdown) file that is an interim step to creating the html file. They are displayed fairly nicely in GitHub, so we want to keep it and look at it there. Click the boxes next to these two files, commit changes (remember to include a commit message), and push them (green up arrow).These exercises will reiterate what you learned in the “Expanding the data wrangling toolkit” tutorial. If you haven’t gone through the tutorial yet, you should do that first.
garden_harvest data to find the total harvest weight in pounds for each vegetable and day of week (HINT: use the wday() function from lubridate). Display the results so that the vegetables are rows but the days of the week are columns.garden_harvest %>%
mutate(day_of_week = wday(date, label = TRUE)) %>%
group_by(vegetable, day_of_week) %>%
summarise(total_harvest = sum(weight)) %>%
pivot_wider(id_cols = vegetable:total_harvest,
names_from = day_of_week,
values_from = total_harvest)
garden_harvest data to find the total harvest in pound for each vegetable variety and then try adding the plot from the garden_planting table. This will not turn out perfectly. What is the problem? How might you fix it? There is an issue in the garden_planting dataset as different varieties are recorded in two different plots (ex: Chinese Red Noodle in plot K and L) which is a case of over-recording data. To fix the issue we could combine both findings or filter the data to make it that only one of the two plots show.garden_harvest %>%
group_by(vegetable, variety) %>%
summarise(total_harvest = sum(weight)*0.00220462) %>%
left_join(garden_planting,
by = c("vegetable", "variety"))
garden_harvest and garden_spending datasets, along with data from somewhere like this to answer this question. You can answer this in words, referencing various join functions. You don’t need R code but could provide some if it’s helpful. To understand how much money you saved you can left join the garden_spending dataset with the garden_harvest data by vegetable and variety to see how much each vegetable harvested cost and the amount harvested in pounds by converting the weight. However we don’t need all of the variables so we would use select() to choose only vegetable, variety, weight, and price to appear in the table. Then to compare the price with a grocer like Whole Foods we would take their data of price/lbs and left join it with what we previously have. To exaclty compare the price difference we would need to take the weight from your garden for each variety and the price/lbs from Whole foods to calculate the exact price. garden_harvest %>%
left_join(garden_spending,
by = c("vegetable", "variety")) %>%
mutate(wt_lbs = weight*0.00220462) %>%
select(vegetable, variety, wt_lbs, price)
garden_harvest %>%
filter(vegetable %in% "tomatoes") %>%
mutate(first_date = fct_reorder(variety, date, min)) %>%
group_by(variety, first_date) %>%
summarise(total_wt_lbs = sum(weight)*0.00220462) %>%
ggplot(aes(x = total_wt_lbs, y = first_date)) +
labs(title = "Total weight of each tomato variety harvested",
x = "",
y = "") +
geom_col()
garden_harvest data, create two new variables: one that makes the varieties lowercase and another that finds the length of the variety name. Arrange the data by vegetable and length of variety name (smallest to largest), with one row for each vegetable variety. HINT: use str_to_lower(), str_length(), and distinct().garden_harvest %>%
mutate(variety_lowercase = str_to_lower(variety),
variety_length = str_length(variety)) %>%
arrange(vegetable, variety_length) %>%
distinct(vegetable, variety, variety_length, .keep_all = TRUE)
garden_harvest data, find all distinct vegetable varieties that have “er” or “ar” in their name. HINT: str_detect() with an “or” statement (use the | for “or”) and distinct().garden_harvest %>%
mutate(special_varieties = str_detect(variety, "er | ar")) %>%
group_by(variety) %>%
distinct(variety, special_varieties)
In this activity, you’ll examine some factors that may influence the use of bicycles in a bike-renting program. The data come from Washington, DC and cover the last quarter of 2014.
{300px}
{300px}
data_site <-
"https://www.macalester.edu/~dshuman1/data/112/2014-Q4-Trips-History-Data.rds"
Trips <- readRDS(gzcon(url(data_site)))
Stations<-read_csv("http://www.macalester.edu/~dshuman1/data/112/DC-Stations.csv")
It’s natural to expect that bikes are rented more at some times of day, some days of the week, some months of the year than others. The variable sdate gives the time (including the date) that the rental started. Make the following plots and interpret them:
sdate. Use geom_density().Trips %>%
ggplot(aes(x=sdate)) +
geom_density(fill = "blue") +
labs(title = "Density plot of events versus sdate",
x = "",
y = "") +
theme(axis.text.y = element_blank())
mutate() with lubridate’s hour() and minute() functions to extract the hour of the day and minute within the hour from sdate. Hint: A minute is 1/60 of an hour, so create a variable where 3:30 is 3.5 and 3:45 is 3.75.Trips %>%
mutate(hour = hour(sdate),
min = (minute(sdate))/60,
time_of_day = hour+min) %>%
ggplot(aes(x = time_of_day)) +
geom_density(fill = "blue") +
labs(title = "Density plot of events versus the time of day",
x = "",
y = "") +
theme(axis.text.y = element_blank())
Trips %>%
mutate(day = (wday(sdate, label = TRUE))) %>%
ggplot(aes(y = day)) +
labs(title = "Barplot of events versus the day of the week",
x = "",
y = "") +
geom_bar()
Trips %>%
mutate(hour = hour(sdate),
min = (minute(sdate))/60,
time_of_day = hour+min,
day_of_week = wday(sdate, label = TRUE)) %>%
ggplot(aes(x = time_of_day)) +
geom_density(fill = "blue") +
theme(axis.text.y = element_blank()) +
labs(title = "Density plot of events versus the time of day per day of the week",
x = "",
y = "") +
facet_wrap(vars(day_of_week))
The variable client describes whether the renter is a regular user (level Registered) or has not joined the bike-rental organization (Causal). The next set of exercises investigate whether these two different categories of users show different rental behavior and how client interacts with the patterns you found in the previous exercises.
fill aesthetic for geom_density() to the client variable. You should also set alpha = .5 for transparency and color=NA to suppress the outline of the density function.Trips %>%
mutate(hour = hour(sdate),
min = (minute(sdate))/60,
time_of_day = hour+min,
day_of_week = wday(sdate, label = TRUE)) %>%
ggplot(aes(x = time_of_day)) +
geom_density(aes(fill = client), color = NA, alpha = .5) +
theme(axis.text.y = element_blank()) +
labs(title = "Density plot of events versus the time of day per day of the week",
x = "",
y = "") +
facet_wrap(vars(day_of_week))
position = position_stack() to geom_density(). In your opinion, is this better or worse in terms of telling a story? What are the advantages/disadvantages of each? In my opinion, these graphs are worse in terms of telling a story. It is eaiser to read when the casual and registered clients overlap in the previous graphs thanks to the transparency level. In this case it seems that there is a different story all together from my interpretation. The way that the teo types of clients are stacked ontop of one another it seems like there are always more casual clients than registered clients however that is not the case and it is clearer in the previous exercise.Trips %>%
mutate(hour = hour(sdate),
min = (minute(sdate))/60,
time_of_day = hour+min,
day_of_week = wday(sdate, label = TRUE)) %>%
ggplot(aes(x = time_of_day)) +
geom_density(position = position_stack(), aes(fill = client), color = NA, alpha = .5) +
theme(axis.text.y = element_blank()) +
labs(title = "Density plot of events versus the time of day per day of the week",
x = "",
y = "") +
facet_wrap(vars(day_of_week))
position = position_stack()). Add a new variable to the dataset called weekend which will be “weekend” if the day is Saturday or Sunday and “weekday” otherwise (HINT: use the ifelse() function and the wday() function from lubridate). Then, update the graph from the previous problem by faceting on the new weekend variable.Trips %>%
mutate(hour = hour(sdate),
min = (minute(sdate))/60,
time_of_day = hour+min,
day_of_week = wday(sdate, label = TRUE)) %>%
mutate(weekend = ifelse(day_of_week %in% c("Sat", "Sun"), "weekend", "weekday")) %>%
ggplot(aes(y = day_of_week)) +
geom_density(fill = "blue", color = NA, alpha = .5) +
labs(title = "Density plot of the amount of riders during the weekdays vs the weekend",
x = "",
y = "Day of the week") +
theme(axis.text.y = element_blank()) +
facet_wrap(vars(weekend))
client and fill with weekday. What information does this graph tell you that the previous didn’t? Is one graph better than the other? These graphs tells us which clients are registered and which client are casual during the week. Unlike the previous graphs that tell us when the clients rode more, either during the weekend or the weekdays. I don’t think one graph is better than the other because they both grive different data that is helpful.Trips %>%
mutate(hour = hour(sdate),
min = (minute(sdate))/60,
time_of_day = hour+min,
day_of_week = wday(sdate, label = TRUE)) %>%
mutate(weekend = ifelse(day_of_week %in% c("Sat", "Sun"), "weekend", "weekday")) %>%
ggplot(aes(y = day_of_week)) +
geom_density(aes(fill = client), color = NA, alpha = .5) +
labs(title = "Density plot of the amount of riders during the week",
x = "",
y = "Day of the week") +
theme(axis.text.y = element_blank()) +
facet_wrap(vars(client))
Stations to make a visualization of the total number of departures from each station in the Trips data. Use either color or size to show the variation in number of departures. We will improve this plot next week when we learn about maps!Trips %>%
mutate(name = sstation) %>%
left_join(Stations,
by = c("name")) %>%
group_by(lat, long) %>%
summarise(total_depart = n()) %>%
ggplot(aes(x = long, y = lat, color = total_depart)) +
labs(title = "Geographic repesentation of the density of departures from the stations",
x = "Longitude",
y = "Latitude") +
geom_point()
Trips %>%
mutate(name = sstation) %>%
left_join(Stations,
by = c("name")) %>%
group_by(lat, long) %>%
summarise(prop_casual = sum(client == "Casual")/n()) %>%
ggplot(aes(x = long, y = lat, color = prop_casual))+
labs(title = "Geographic repesentation of the percentage of casual departures from the stations",
x = "Longitude",
y = "Latitude") +
geom_point()
as_date(sdate) converts sdate from date-time format to date format.top_ten_stations <-
Trips %>%
mutate(just_date = as_date(sdate)) %>%
group_by(sstation, just_date) %>%
summarise(num_depart = n()) %>%
arrange(desc(num_depart)) %>%
head(n = 10)
top_ten_stations
Trips %>%
mutate(just_date = as_date(sdate)) %>%
inner_join(top_ten_stations,
by = c("sstation", "just_date"))
Trips %>%
mutate(just_date = as_date(sdate)) %>%
inner_join(top_ten_stations,
by = c("sstation", "just_date")) %>%
mutate(weekday = (wday(sdate, label = TRUE))) %>%
group_by(client, weekday) %>%
summarise(client_depart = n()) %>%
group_by(client) %>%
mutate(prop = client_depart/sum(client_depart)) %>%
pivot_wider(id_cols = weekday,
names_from = client,
values_from = prop)
This problem uses the data from the Tidy Tuesday competition this week, kids. If you need to refresh your memory on the data, read about it here.
facet_geo(). The graphic won’t load below since it came from a location on my computer. So, you’ll have to reference the original html on the moodle page to see it.kids %>%
group_by(state) %>%
ggplot(aes(x = year, y = inf_adj_perchild)) +
geom_line(size = 10, aspect_ratio = 10) +
labs(title = "Change in public spending on libraries",
subtitle = "Dollars spent per child, adjusted for inflation",
x = "",
y = "") +
facet_geo(vars(state), scales = "free")
DID YOU REMEMBER TO UNCOMMENT THE OPTIONS AT THE TOP?